又一次不接受Elasticsearch官方建议导致的事故

发表于 2025-01-08 更新于 2026-05-01 分类于 Elasticsearch 本文字数： 7.5k 阅读时长 ≈ 7 分钟

记录一下
一次Elasticsearch集群事故分析、排查、处理

背景

上午同事拉群反馈他们在用的一个ES集群出现写入超时的情况，查看ES集群状态发现已经是RED了

这是一个3 * 8核32G(jvm堆配置了16G)的集群
出问题的时候，集群内有5000个索引，每个索引shard数是6，副本数是1，也就是一个索引是12个shard
- shard数已经达到了6w个，这不包含ES自带的系统索引。
集群的实例重启过
数据量大约480G

看到这里是不是有一种似曾相识的感觉?
=> 一次不接受Elasticsearch官方建议导致的事故

立马联系业务删除历史数据，业务这边明确可以删除22年、23年的历史数据。

处理

调速

在业务执行批量删除后，我这边先调大了集群恢复相关的参数

PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 200,
    "indices.recovery.max_bytes_per_sec": "80mb"
  }
}

关于恢复速度相关参数解释:https://elastic.ac.cn/guide/en/elasticsearch/reference/8.17/recovery.html

在集群恢复的期间，ES实例一直在输出WARN日志

Unable to acquire permit to use snapshot files during recovery, 
this recovery will recover index files from the source node. 
Ensure snapshot files can be used during recovery by setting 
[indices.recovery.max_concurrent_snapshot_file_downloads] to be no greater than [25]

看了下集群设置

GET /_cluster/settings?include_defaults=true

{
	"recovery": {
		"recovery_activity_timeout": "1800000ms",
		"retry_delay_network": "5s",
		"internal_action_timeout": "15m",
		"max_concurrent_snapshot_file_downloads_per_node": "25",
		"retry_delay_state_sync": "500ms",
		"max_concurrent_snapshot_file_downloads": "5",// 这里并没有比25大
		"internal_action_long_timeout": "1800000ms",
		"max_concurrent_operations": "1",
		"use_snapshots": "true",
		"max_concurrent_file_chunks": "2"
	}
}

所以先忽略，恢复后，该日志就不再输出了。

排查

找了一个ES节点下载最近几天的日志，排查实例退出的原因

实例退出前有大量如下异常输出

this may be caused by lack of enough unfragmented virtual address space 
or too restrictive virtual memory limits enforced by the operating system, 
preventing us to map a chunk of 414627 bytes. Please review 'ulimit -v', 'ulimit -m' (both should return 'unlimited'), 
and 'sysctl vm.max_map_count'. More information: 
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

这里看了一下max_map_count没有问题

[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# ulimit -m
unlimited
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# ulimit -v
unlimited
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# 
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# sysctl vm.max_map_count
vm.max_map_count = 262144
[root@prd-jiankunking-notify-es-3: /app/es/data/nodes/0]# docker exec -it 21871c061981 /bin/bash
root@prd-jiankunking-notify-es-3:/usr/share/elasticsearch# ulimit -m
unlimited
root@prd-jiankunking-notify-es-3:/usr/share/elasticsearch# ulimit -v
unlimited
root@prd-jiankunking-notify-es-3:/usr/share/elasticsearch# sysctl vm.max_map_count
vm.max_map_count = 262144

262144是elasticsearch 7.16.2启动的时候提示的最小值

找到实例退出时间点的日志

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fe4d805c000, 65536, 1) failed; error='Not enough space' (errno=12)

# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
[thread 330 also had an error][thread 232 also had an error]
# An error report file with more information is saved as
# logs/hs_err_pid7.log

再看下logs/hs_err_pid7.log的日志内容

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 65536 bytes for committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   The process is running with CompressedOops enabled, and the Java Heap may be blocking the growth of the native heap
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
#   JVM is running with Zero Based Compressed Oops mode in which the Java heap is
#     placed in the first 32GB address space. The Java Heap base address is the
#     maximum limit for the native heap growth. Please use -XX:HeapBaseMinAddress
#     to set the Java Heap base and to place the Java Heap above 32GB virtual address.
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2728), pid=7, tid=213
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.1+12 (17.0.1+12) (build 17.0.1+12)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.1+12 (17.0.1+12, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Core dump will be written. Default location: /usr/share/elasticsearch/core.7
#

---------------  S U M M A R Y ------------

Command Line: -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -Xms16g -Xmx16g -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-18266345477890424435 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -XX:MaxDirectMemorySize=8589934592 -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=25 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=default -Des.distribution.type=docker -Des.bundled_jdk=true org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true

Host: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz, 8 cores, 31G, Ubuntu 20.04.3 LTS
Time: Wed Jan  8 00:44:25 2025 UTC elapsed time: 28895.244676 seconds (0d 8h 1m 35s)

结论

到这里可以看出是ES在申请Native memory allocation (mmap) 的时候内存不足,进程退出了。

数据删除后，通过

1	GET /_cat/health?v

看到active_shards_percent恢复进度到100%后，但还有19个shard unassign

看下具体没有分配的原因

1	GET /_cluster/allocation/explain

返回内容如下

{
    "reason" : "ALLOCATION_FAILED",
    "at" : "2025-01-07T16:28:22.026Z",
    "failed_allocation_attempts" : 5,
    "details" : """failed shard on node [IYowNoBkTzq_WPjXgb_10Q]: 
    failed recovery, failure RecoveryFailedException[[notification-jiankunking-2025.01][3]: 
    Recovery failed on {es-notify-203}{IYowNoBkTzq_WPjXgb_10Q}
    {nzTGfgdBQvykqKacU_cyzA}{127.0.0.203}{127.0.0.203:9301}{cdfhilmrstw}{ml.machine_memory=33568223232, xpack.installed=true, 
    transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=17179869184}]; 
    nested: IndexShardRecoveryException[failed to recover from gateway]; nested:
     EngineCreationFailureException[failed to open reader on writer]; nested: 
     IOException[Map failed: MMapIndexInput(path="/usr/share/elasticsearch/data/nodes/0/indices/n5E4PEFpTKKDgGdNmEqSPQ/3/index/_j50_Lucene80_0.dvd") 
     [this may be caused by lack of enough unfragmented virtual address space or too restrictive virtual memory limits enforced by the operating system, 
     preventing us to map a chunk of 4321640 bytes. Please review 'ulimit -v', 'ulimit -m' (both should return 'unlimited'), 
     and 'sysctl vm.max_map_count'. More information: 
     http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html]]; """,
    "last_allocation_status" : "no"
}

手动重新分配下

1	POST /_cluster/reroute?retry_failed=true

集群状态red->yellow->green

业务清理历史数据后，集群

shard数降为4w
数据量降为335.3 GB